Hardware-Efficient Attention for Accelerated LLM Decoding
GTA and GLA for Scalable, Low-Latency Language Model Inference
Authors: T. Zadouri et al.
Published on Arxiv: 2025-05-27
Link: http://arxiv.org/abs/2505.21487v1
Institutions: Department of Computer Science, Princeton University • Princeton Language and Intelligence, Princeton University
Keywords: hardware-efficient attention, KV cache, arithmetic intensity, Grouped-Tied Attention (GTA), Grouped Latent Attention (GLA), Multi-head Latent Attention (MLA), Grouped-Query Attention (GQA), tensor parallelism, paged KV, FlashAttention3, speculative decoding, latency, throughput, LLM inference, FineWeb-Edu-100B, NVIDIA H100, memory-bound workload
Large Language Model (LLM) decoding is critically hampered by the memory bandwidth required to load large key-value (KV) caches—particularly as batch sizes and context lengths increase. Modern inference workloads are frequently memory-bound, with data transfer of KV caches between memory tiers limiting the effectiveness of hardware acceleration and parallelism. Standard attention mechanisms like Multi-Head Attention (MHA) exacerbate this issue, as their high consumption of memory bandwidth does not align with current trends where computational resources have outstripped growth in memory bandwidth.
To address these challenges, this work introduces solutions designed to boost hardware efficiency while maintaining model quality:
- Two new attention variants are proposed: Grouped-Tied Attention (GTA), which collapses key and value states into a single state per group for improved arithmetic intensity, and Grouped Latent Attention (GLA), which enables parallel-friendly latent attention through head grouping and low-level GPU optimizations.
- Extensive empirical evaluation is conducted on models from 183M to 1.47B parameters, using strong benchmarks and comparisons with existing baselines such as MHA, MQA, GQA, and MLA. System-level optimizations—including software pipelining, warp specialization, distributed offset calculation, and paged KV cache management—are implemented for high processing efficiency on NVIDIA H100 GPUs.
These innovations yield strong, measurable outcomes:
- GTA matches Grouped-Query Attention (GQA) quality while reducing KV cache needs by approximately half and doubling arithmetic intensity.
- GLA maintains Multi-head Latent Attention (MLA) quality and can be up to 2× faster than FlashMLA for speculative decoding scenarios.
- Across multiple model scales, downstream task accuracy is preserved or improved, with the GLA-2 variant achieving higher average accuracy than MLA in large models (e.g., 60.0% vs. 59.1% for the 1.47B XL model).
- Live system benchmarks indicate that GLA cuts latency and doubles throughput in server settings, with optimized kernels reaching up to 93% of peak memory bandwidth and 70% of theoretical GPU compute capacity (TFLOPS).
Building on these results, the paper draws several key conclusions and suggests paths for future research:
- Maximizing arithmetic intensity and parallelism enables substantial advances in hardware efficiency for LLM inference.
- GTA offers an effective and memory-saving replacement for GQA, while GLA serves as a faster, scalable alternative to MLA—with both facilitating reduced latency and improved throughput on modern GPUs.
- Open-sourced kernels and techniques are directly usable for efficient LLM inference at scale; further work may extend these strategies to even larger models or investigate their synergy with emerging network architectures.